The idea

In my previous blog post, we got to know the idea of "indentation-based complexity". We took a static view on the Linux kernel to spot the most complex areas.

This time, we wanna track the evolution of the indentation-based complexity of a software system over time. We are especially interested in it's correlation between the lines of code. Because if we have a more or less stable development of the lines of codes of our system, but an increasing number of indentation per source code file, we surely got a complexity problem.

Again, this analysis is higly inspired by Adam Tornhill's book "Software Design X-Ray" , which I currently always recommend if you want to get a deep dive into software data analysis.

The data

For the calculation of the evolution of our software system, we can use data from the version control system. In our case, we can get all changes to Java source code files with Git. We just need so say the right magic words, which is

git log -p -- *.java

This gives us data like the following:

commit e5254156eca3a8461fa758f17dc5fae27e738ab5
Author: Antoine Rey <antoine.rey@gmail.com>
Date:   Fri Aug 19 18:54:56 2016 +0200

    Convert Controler's integration test to unit test

diff --git a/src/test/java/org/springframework/samples/petclinic 
/web/CrashControllerTests.java b/src/test/java/org/springframework/samples/petclinic/web/CrashControllerTests.java
index ee83b8a..a83255b 100644
--- a/src/test/java/org/springframework/samples/petclinic/web/CrashControllerTests.java
+++ b/src/test/java/org/springframework/samples/petclinic/web/CrashControllerTests.java
@@ -1,8 +1,5 @@
 package org.springframework.samples.petclinic.web;

-import static org.springframework.test.web.servlet.request.MockMvcRequestBuilders.get;
-import static org.springframework.test.web.servlet.result.MockMvcResultMatchers.*;
-
 import org.junit.Before;
 import org.junit.Test;
 import org.junit.runner.RunWith;

We have the

commit sha
commit e5254156eca3a8461fa758f17dc5fae27e738ab5
author's name
Author: Antoine Rey <antoine.rey@gmail.com>
date of the commit
Date: Fri Aug 19 18:54:56 2016 +0200
commit message
Convert Controler's integration test to unit test
names of the files that changes (after and before)
diff --git a/src/test/java/org/springframework/samples/petclinic /web/CrashControllerTests.java b/src/test/java/org/springframework/samples/petclinic/web/CrashControllerTests.java
the extended index header
index ee83b8a..a83255b 100644
--- a/src/test/java/org/springframework/samples/petclinic/web/CrashControllerTests.java +++ b/src/test/java/org/springframework/samples/petclinic/web/CrashControllerTests.java

and the full file diff where we can see additions or modifications (+) and deletions (-)

  package org.springframework.samples.petclinic.web;

  -import static org.springframework.test.web.servlet.request.MockMvcRequestBuilders.get;
  -import static org.springframework.test.web.servlet.result.MockMvcResultMatchers.*;     
  -
   import org.junit.Before;

We "just" have to get this data into our favorite data analysis framework, which is, of course, Pandas :-). We can actually do that! Let's see how!

Advanced data wangling

Reading in such a semi-structured data is a little challenge. But we can do it with some tricks. First, we read in the whole Git diff history by standard means, using read_csv and the separator \n to get one row per line. We make sure to give the columns a nice name as well.



In [19]:

    
import pandas as pd

diff_raw = pd.read_csv(
    "../../buschmais-spring-petclinic_fork/git_diff.log",
    sep="\n",
    names=["raw"])
diff_raw.head(5)









    Out[19]:







  
    
      
      raw
    
  
  
    
      0
      commit e5254156eca3a8461fa758f17dc5fae27e738ab5
    
    
      1
      Author: Antoine Rey <antoine.rey@gmail.com>
    
    
      2
      Date:   Fri Aug 19 18:54:56 2016 +0200
    
    
      3
      Convert Controler's integration test to un...
    
    
      4
      diff --git a/src/test/java/org/springframework...



In [20]:

    
diff_raw[diff_raw.raw.str.startswith("commit")].head()









    Out[20]:







  
    
      
      raw
    
  
  
    
      0
      commit e5254156eca3a8461fa758f17dc5fae27e738ab5
    
    
      288
      commit 1a6572d1ac0c7659d9243405074f3f19f9a93328
    
    
      409
      commit 078bdc6cfbada99413041d859fdeb3d63685c40b
    
    
      446
      commit 4c722465d8091cfa29b54674138c24c199259172
    
    
      525
      commit ca755be44a2844e10876b8e38e68b9adf736b486

The output is the commit data that I've describe above where each in line the text file represents one row in the DataFrame (without blank lines).

Cleansing

We skip all the data we don't need for sure. Especially the "extended index header" with the two lines that being with +++ and --- are candidates to mix with the real diff data that begins also with a + or a -. Furtunately, we can identify these rows easily: These are the rows that begin with the row that starts with index. Using the shift operation starting at the row with index, we can get rid of all those lines.



In [21]:

    
index_row = diff_raw.raw.str.startswith("index ")
ignored_diff_rows = (index_row.shift(1) | index_row.shift(2))
diff_raw = diff_raw[~(index_row | ignored_diff_rows)]
diff_raw.head(10)









    Out[21]:







  
    
      
      raw
    
  
  
    
      0
      commit e5254156eca3a8461fa758f17dc5fae27e738ab5
    
    
      1
      Author: Antoine Rey <antoine.rey@gmail.com>
    
    
      2
      Date:   Fri Aug 19 18:54:56 2016 +0200
    
    
      3
      Convert Controler's integration test to un...
    
    
      4
      diff --git a/src/test/java/org/springframework...
    
    
      8
      @@ -1,8 +1,5 @@
    
    
      9
      package org.springframework.samples.petclinic...
    
    
      10
      -import static org.springframework.test.web.se...
    
    
      11
      -import static org.springframework.test.web.se...
    
    
      12
      -

Extracting metadata

Next, we extract some metadata of a commit. We can identify the different entries by using a regular expression that looks up a specific key word for each line. We extract each individual information into a new Series/column because we need it for each change line during the software's history.



In [22]:

    
diff_raw['commit'] = diff_raw.raw.str.split("^commit ").str[1]
diff_raw['timestamp'] = pd.to_datetime(diff_raw.raw.str.split("^Date: ").str[1])
diff_raw['path'] = diff_raw.raw.str.extract("^diff --git.* b/(.*)", expand=True)[0]
diff_raw.head()









    Out[22]:







  
    
      
      raw
      commit
      timestamp
      path
    
  
  
    
      0
      commit e5254156eca3a8461fa758f17dc5fae27e738ab5
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      NaT
      NaN
    
    
      1
      Author: Antoine Rey <antoine.rey@gmail.com>
      NaN
      NaT
      NaN
    
    
      2
      Date:   Fri Aug 19 18:54:56 2016 +0200
      NaN
      2016-08-19 16:54:56
      NaN
    
    
      3
      Convert Controler's integration test to un...
      NaN
      NaT
      NaN
    
    
      4
      diff --git a/src/test/java/org/springframework...
      NaN
      NaT
      src/test/java/org/springframework/samples/petc...

To assign each commit's metadata to the remaining rows, we forward fill those rows with the metadata by using the fillna method.



In [23]:

    
diff_raw = diff_raw.fillna(method='ffill')
diff_raw.head(8)









    Out[23]:







  
    
      
      raw
      commit
      timestamp
      path
    
  
  
    
      0
      commit e5254156eca3a8461fa758f17dc5fae27e738ab5
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      NaT
      NaN
    
    
      1
      Author: Antoine Rey <antoine.rey@gmail.com>
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      NaT
      NaN
    
    
      2
      Date:   Fri Aug 19 18:54:56 2016 +0200
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      NaN
    
    
      3
      Convert Controler's integration test to un...
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      NaN
    
    
      4
      diff --git a/src/test/java/org/springframework...
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      src/test/java/org/springframework/samples/petc...
    
    
      8
      @@ -1,8 +1,5 @@
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      src/test/java/org/springframework/samples/petc...
    
    
      9
      package org.springframework.samples.petclinic...
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      src/test/java/org/springframework/samples/petc...
    
    
      10
      -import static org.springframework.test.web.se...
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      src/test/java/org/springframework/samples/petc...

Identifying source code lines

We can now focus on the changed source code lines. We can identify



In [24]:

    
diff_raw["i"] = diff_raw.raw.str[1:].str.len() - diff_raw.raw.str[1:].str.lstrip().str.len()
diff_raw.head()









    Out[24]:







  
    
      
      raw
      commit
      timestamp
      path
      i
    
  
  
    
      0
      commit e5254156eca3a8461fa758f17dc5fae27e738ab5
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      NaT
      NaN
      0
    
    
      1
      Author: Antoine Rey <antoine.rey@gmail.com>
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      NaT
      NaN
      0
    
    
      2
      Date:   Fri Aug 19 18:54:56 2016 +0200
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      NaN
      0
    
    
      3
      Convert Controler's integration test to un...
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      NaN
      3
    
    
      4
      diff --git a/src/test/java/org/springframework...
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      src/test/java/org/springframework/samples/petc...
      0



In [25]:

    
%%timeit
diff_raw['added'] = diff_raw.raw.str.extract("^\+( *).*$", expand=True)[0].str.len()
diff_raw['deleted'] = diff_raw.raw.str.extract("^-( *).*$", expand=True)[0].str.len()
diff_raw.head()









    



91.6 ms ± 2.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

For our later indentation-based complexity calculation, we have to make sure that each line



In [26]:

    
diff_raw['line'] = diff_raw.raw.str.replace("\t", "    ")
diff_raw.head()









    Out[26]:







  
    
      
      raw
      commit
      timestamp
      path
      i
      added
      deleted
      line
    
  
  
    
      0
      commit e5254156eca3a8461fa758f17dc5fae27e738ab5
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      NaT
      NaN
      0
      NaN
      NaN
      commit e5254156eca3a8461fa758f17dc5fae27e738ab5
    
    
      1
      Author: Antoine Rey <antoine.rey@gmail.com>
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      NaT
      NaN
      0
      NaN
      NaN
      Author: Antoine Rey <antoine.rey@gmail.com>
    
    
      2
      Date:   Fri Aug 19 18:54:56 2016 +0200
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      NaN
      0
      NaN
      NaN
      Date:   Fri Aug 19 18:54:56 2016 +0200
    
    
      3
      Convert Controler's integration test to un...
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      NaN
      3
      NaN
      NaN
      Convert Controler's integration test to un...
    
    
      4
      diff --git a/src/test/java/org/springframework...
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      src/test/java/org/springframework/samples/petc...
      0
      NaN
      NaN
      diff --git a/src/test/java/org/springframework...



In [27]:

    
diff = \
    diff_raw[
        (~diff_raw['added'].isnull()) | 
             (~diff_raw['deleted'].isnull())].copy()
diff.head()









    Out[27]:







  
    
      
      raw
      commit
      timestamp
      path
      i
      added
      deleted
      line
    
  
  
    
      10
      -import static org.springframework.test.web.se...
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      src/test/java/org/springframework/samples/petc...
      0
      NaN
      0.0
      -import static org.springframework.test.web.se...
    
    
      11
      -import static org.springframework.test.web.se...
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      src/test/java/org/springframework/samples/petc...
      0
      NaN
      0.0
      -import static org.springframework.test.web.se...
    
    
      12
      -
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      src/test/java/org/springframework/samples/petc...
      0
      NaN
      0.0
      -
    
    
      19
      +import static org.springframework.test.web.se...
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      src/test/java/org/springframework/samples/petc...
      0
      0.0
      NaN
      +import static org.springframework.test.web.se...
    
    
      20
      +import static org.springframework.test.web.se...
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      src/test/java/org/springframework/samples/petc...
      0
      0.0
      NaN
      +import static org.springframework.test.web.se...



In [28]:

    
diff['is_comment'] = diff.line.str[1:].str.match(r' *(//|/*\*).*')
diff['is_empty'] = diff.line.str[1:].str.replace(" ","").str.len() == 0
diff['is_source'] = ~(diff['is_empty'] | diff['is_comment'])
diff.head()









    Out[28]:







  
    
      
      raw
      commit
      timestamp
      path
      i
      added
      deleted
      line
      is_comment
      is_empty
      is_source
    
  
  
    
      10
      -import static org.springframework.test.web.se...
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      src/test/java/org/springframework/samples/petc...
      0
      NaN
      0.0
      -import static org.springframework.test.web.se...
      False
      False
      True
    
    
      11
      -import static org.springframework.test.web.se...
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      src/test/java/org/springframework/samples/petc...
      0
      NaN
      0.0
      -import static org.springframework.test.web.se...
      False
      False
      True
    
    
      12
      -
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      src/test/java/org/springframework/samples/petc...
      0
      NaN
      0.0
      -
      False
      True
      False
    
    
      19
      +import static org.springframework.test.web.se...
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      src/test/java/org/springframework/samples/petc...
      0
      0.0
      NaN
      +import static org.springframework.test.web.se...
      False
      False
      True
    
    
      20
      +import static org.springframework.test.web.se...
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      src/test/java/org/springframework/samples/petc...
      0
      0.0
      NaN
      +import static org.springframework.test.web.se...
      False
      False
      True



In [29]:

    
diff.raw.str[0].value_counts()









    Out[29]:





+    11483
-     7525
Name: raw, dtype: int64



In [30]:

    
diff['lines_added'] = (~diff.added.isnull()).astype('int')
diff['lines_deleted'] = (~diff.deleted.isnull()).astype('int')
diff.head()









    Out[30]:







  
    
      
      raw
      commit
      timestamp
      path
      i
      added
      deleted
      line
      is_comment
      is_empty
      is_source
      lines_added
      lines_deleted
    
  
  
    
      10
      -import static org.springframework.test.web.se...
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      src/test/java/org/springframework/samples/petc...
      0
      NaN
      0.0
      -import static org.springframework.test.web.se...
      False
      False
      True
      0
      1
    
    
      11
      -import static org.springframework.test.web.se...
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      src/test/java/org/springframework/samples/petc...
      0
      NaN
      0.0
      -import static org.springframework.test.web.se...
      False
      False
      True
      0
      1
    
    
      12
      -
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      src/test/java/org/springframework/samples/petc...
      0
      NaN
      0.0
      -
      False
      True
      False
      0
      1
    
    
      19
      +import static org.springframework.test.web.se...
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      src/test/java/org/springframework/samples/petc...
      0
      0.0
      NaN
      +import static org.springframework.test.web.se...
      False
      False
      True
      1
      0
    
    
      20
      +import static org.springframework.test.web.se...
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      src/test/java/org/springframework/samples/petc...
      0
      0.0
      NaN
      +import static org.springframework.test.web.se...
      False
      False
      True
      1
      0



In [31]:

    
diff = diff.fillna(0)
#diff.to_excel("temp.xlsx")
diff.head()









    Out[31]:







  
    
      
      raw
      commit
      timestamp
      path
      i
      added
      deleted
      line
      is_comment
      is_empty
      is_source
      lines_added
      lines_deleted
    
  
  
    
      10
      -import static org.springframework.test.web.se...
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      src/test/java/org/springframework/samples/petc...
      0
      0.0
      0.0
      -import static org.springframework.test.web.se...
      False
      False
      True
      0
      1
    
    
      11
      -import static org.springframework.test.web.se...
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      src/test/java/org/springframework/samples/petc...
      0
      0.0
      0.0
      -import static org.springframework.test.web.se...
      False
      False
      True
      0
      1
    
    
      12
      -
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      src/test/java/org/springframework/samples/petc...
      0
      0.0
      0.0
      -
      False
      True
      False
      0
      1
    
    
      19
      +import static org.springframework.test.web.se...
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      src/test/java/org/springframework/samples/petc...
      0
      0.0
      0.0
      +import static org.springframework.test.web.se...
      False
      False
      True
      1
      0
    
    
      20
      +import static org.springframework.test.web.se...
      e5254156eca3a8461fa758f17dc5fae27e738ab5
      2016-08-19 16:54:56
      src/test/java/org/springframework/samples/petc...
      0
      0.0
      0.0
      +import static org.springframework.test.web.se...
      False
      False
      True
      1
      0



In [32]:

    
commits_per_day = diff.set_index('timestamp').resample("D").sum()
commits_per_day.head()









    Out[32]:







  
    
      
      i
      added
      deleted
      is_comment
      is_empty
      is_source
      lines_added
      lines_deleted
    
    
      timestamp
      
      
      
      
      
      
      
      
    
  
  
    
      2009-05-05
      164
      0.0
      0.0
      0.0
      87.0
      194.0
      277
      4
    
    
      2009-05-06
      128
      0.0
      0.0
      2.0
      42.0
      135.0
      136
      43
    
    
      2009-05-07
      165
      0.0
      0.0
      1.0
      60.0
      138.0
      189
      10
    
    
      2009-05-08
      0
      0.0
      0.0
      0.0
      0.0
      0.0
      0
      0
    
    
      2009-05-09
      0
      0.0
      0.0
      0.0
      0.0
      0.0
      0
      0



In [33]:

    
%matplotlib inline
commits_per_day.cumsum().plot()









    Out[33]:





<matplotlib.axes._subplots.AxesSubplot at 0x15bd2fe7b00>



In [34]:

    
(commits_per_day.added - commits_per_day.deleted).cumsum().plot()









    Out[34]:





<matplotlib.axes._subplots.AxesSubplot at 0x15bd3287128>



In [35]:

    
(commits_per_day.lines_added - commits_per_day.lines_deleted).cumsum().plot()









    Out[35]:





<matplotlib.axes._subplots.AxesSubplot at 0x15bd33ce198>



In [36]:

    
diff_sum = diff.sum()
diff_sum.lines_added - diff_sum.lines_deleted









    Out[36]:





3958



In [37]:

    
3913









    Out[37]:





3913

	raw
0	commit e5254156eca3a8461fa758f17dc5fae27e738ab5
1	Author: Antoine Rey <antoine.rey@gmail.com>
2	Date: Fri Aug 19 18:54:56 2016 +0200
3	Convert Controler's integration test to un...
4	diff --git a/src/test/java/org/springframework...

	raw
0	commit e5254156eca3a8461fa758f17dc5fae27e738ab5
288	commit 1a6572d1ac0c7659d9243405074f3f19f9a93328
409	commit 078bdc6cfbada99413041d859fdeb3d63685c40b
446	commit 4c722465d8091cfa29b54674138c24c199259172
525	commit ca755be44a2844e10876b8e38e68b9adf736b486

	raw	commit	timestamp	path
0	commit e5254156eca3a8461fa758f17dc5fae27e738ab5	e5254156eca3a8461fa758f17dc5fae27e738ab5	NaT	NaN
1	Author: Antoine Rey <antoine.rey@gmail.com>	NaN	NaT	NaN
2	Date: Fri Aug 19 18:54:56 2016 +0200	NaN	2016-08-19 16:54:56	NaN
3	Convert Controler's integration test to un...	NaN	NaT	NaN
4	diff --git a/src/test/java/org/springframework...	NaN	NaT	src/test/java/org/springframework/samples/petc...

	raw	commit	timestamp	path	added	deleted	line
10	-import static org.springframework.test.web.se...	e5254156eca3a8461fa758f17dc5fae27e738ab5	2016-08-19 16:54:56	src/test/java/org/springframework/samples/petc...	NaN	0.0	-import static org.springframework.test.web.se...
11	-import static org.springframework.test.web.se...	e5254156eca3a8461fa758f17dc5fae27e738ab5	2016-08-19 16:54:56	src/test/java/org/springframework/samples/petc...	NaN	0.0	-import static org.springframework.test.web.se...
12	-	e5254156eca3a8461fa758f17dc5fae27e738ab5	2016-08-19 16:54:56	src/test/java/org/springframework/samples/petc...	NaN	0.0	-
19	+import static org.springframework.test.web.se...	e5254156eca3a8461fa758f17dc5fae27e738ab5	2016-08-19 16:54:56	src/test/java/org/springframework/samples/petc...	0.0	NaN	+import static org.springframework.test.web.se...
20	+import static org.springframework.test.web.se...	e5254156eca3a8461fa758f17dc5fae27e738ab5	2016-08-19 16:54:56	src/test/java/org/springframework/samples/petc...	0.0	NaN	+import static org.springframework.test.web.se...

	i	added	deleted	is_comment	is_empty	is_source	lines_added	lines_deleted
timestamp
2009-05-05	164	0.0	0.0	0.0	87.0	194.0	277	4
2009-05-06	128	0.0	0.0	2.0	42.0	135.0	136	43
2009-05-07	165	0.0	0.0	1.0	60.0	138.0	189	10
2009-05-08	0	0.0	0.0	0.0	0.0	0.0	0	0
2009-05-09	0	0.0	0.0	0.0	0.0	0.0	0	0